Chapter 9 Constructions and Idioms

9.1 Collostruction

In this chapter, I would like to talk about the relationship between a construction and words. Words may co-occur to form collocation patterns. When words co-occur with a particular morphosyntactic pattern, they would form collostruction patterns.

Here I would like to introduce a widely-applied method for research on the meanings of constructional schemas—Collostructional Aanalysis (Stefanowitsch and Gries 2003). This is the major framework in corpus linguistics for the study of the relationship between words and constructions.

The idea behind collostructional analysis is simple: the meaning of a morphosyntactic construction can be determined very often by its co-occurring words.

In particular, words that are strongly associated (i.e., co-occurring) with the construction are referred to as collexemes of the construction.

Collostruction Analysis is an umbrella term, which covers several sub-analyses for constructional semantics:

  • collexeme analysis
  • co-varying collexeme analysis
  • distinctive collexeme analysis

This chapter will focus on the first one, collexeme analysis, whose principles can be extended to the other analyses.

Also, I will demonstrate how we can conduct a collexeme analysis by using the R script written by Stefan Gries (Collostructional Analysis).

9.2 Corpus

I will use the Apple News Corpus from Chapter 8 as our corpus.

And in this demonstration, I would like to look at a particular morphosyntactic frame in Chinese, X + 起來. Our goal is simple: in order to find out the semantics of this constructional schema, it would be very informative if we can find out which words tend to strongly occupy this X slot of the constructional schema.

So our first step is to load the corpus into R.

9.3 Word Segmentation

Because Apple News Corpus is a raw-text corpus, we first word-segment the corpus.

9.4 Extract Constructions

With the word boundary information, we can now extract our target patterns from the corpus using regular expressions.

9.5 Distributional Information Needed for CA

To perform the collostructional analysis, which is essentially a statistical analysis of the association between the words and the constructions, we need to collect necessary distributional information.

Also, to use Stefan Gries’ R script of Collostructional Analysis, we need the following information:

  1. Joint Frequencies of Words and Constructions
  2. Frequencies of Words in Corpus
  3. Corpus Size (total number of words in corpus)
  4. Construction Size (total number of constructions in corpus)

9.5.1 Word Frequency List

9.5.3 Other Information

We prepare necessary distributional information for the later collostructional analysis.

## Corpus Size:  3209617
## Construction Size:  546

9.5.4 Creat Output File

This is to create an empty output txt file to keep the results from the Collostructional Analysis script.

9.5.5 Run coll.analysis.r

Finally we are now ready to perform the collostructional analysis using Stefan Gries’ coll.analysis.r.

This is an R script with interactive instructions. When you run the analysis, you will be prompted with guide questions, to which you would need to fill out necessary information/answers.

Specifically, data to be entered include:

  • analysis to perform: 1
  • name of construction: QILAI
  • corpus size: 3209617
  • freq of constructions: 546
  • index of association strength: 1 (=fisher-exact)
  • sorting: 4 (=collostruction strength)
  • decimals: 2
  • text file with the raw data: <qilai.tsv>
  • Where to save output: 1 (= text file)
  • output file: <qilai_results.txt>

The output of coll.analysis.r is as shown below:


The output from coll.analysis.r is a text file with both the result data frame (i.e., the data frame with all the statistics) as well as detailed explanations provided by Stefan Gries.

We can also extract the result data frame from the text file. The output file from the collexeme analysis of QILAI is available in demo_data/qilai_results.txt.

  • We first load the result txt file like a normal text file using readlines()
  • We extract the lines which include the statistics and parse them into a CSV data frame using read_tsv

With the collexeme analysis statistics, we can therefore explore the top N collexemes according to specific association metrics.

The bar plots above show the top 10 collexemes based on four different metrics: obs.freq, delta.p.contr.to.word, delta.p.word.to.contr, and coll.strength.


Many studies have shown that Chinese makes use of large proportion of four-character idioms in the discourse. This section will provide a exploratory analysis of four-character idioms in Chinese.

In our demo_data directory, there is a file dict-ch-idiom.txt, which includes a list of four-character idioms in Chinese. These idioms are collected from 搜狗輸入法詞庫 and the original file formats (.scel) have been combined, removed of duplicate cases, and converted to a more machine-readable format, i.e., .txt.

You can load the dataset in R for exploration of idioms.


9.6 Exercises

The following exercises use the dataset Yet Another Chinese News Dataset from Kaggle. The dataset is availabe on our dropbox demo_data/corpus-news-collection.csv.

The dataset is a collection of news articles in Traditional and Simplified Chinese, including some Internet news outlets that are NOT Chinese state media.


Exercise 9.1 Please conduct a collostruction analysis for the aspectual construction “X + 了” in Chinese. Extract all tokens of this consturction from the news corpus and identify all words preceding the aspectual marker.

Based on the distributional information, conduct the collexemes analysis using the coll.analysis.r and present the collexemes that significantly co-occur with the construction “X + 了” in the X slot. Rank the collexemes according to the collostrength provided by Stefan Gries’ script.
##    user  system elapsed 
##  32.227   2.023  34.255
  • Corpus Size: 7885435
  • Consturction Size: 25949

  • The output of the Collexeme Analysis (coll.analysis.r)


Exercise 9.2 Please load the Chinese News dataset—demo_data/corpus-news-collection.csv—in R, tokenize the entire corpus into words, and create a frequency list of all four-character words/idioms included in the list demo_data/dict-ch-idiom.txt.

Please include both the frequency as well as the dispersion of each four-character idiom (Dispersion is defined as the number of articles where it is observed.)

Please arrange the four-character idioms according to their dispersion.
##    user  system elapsed 
##  16.087   0.285  16.375
## [1] 8135709

Exercise 9.3 Let’s assume that we are interested in the idioms of the schema of X_X_, such as “一心一意”, “民脂民膏”, “滿坑滿谷” (i.e., idioms where the first character is the same as the third character).

Please find the top 20 idioms of this schema and visualize their frequencies in a bar plot as shown below.

Exercise 9.4 Continuing the previous exercise, the idioms of the schema X_X_ may have different types of X. Here we refer to the character X as the pivot of the idiom.

Please identify all the pivots for idioms of this schema which have at least two types of constructional variants in the corpus (i.e., its type frequency >= 2) and visualize their type frequencies as shown below.

For example, the type frequency of the pivot schema “不_不_” is 18 in the corpus, including constructional variants such as “不知不覺”, “不折不扣”, “不倫不類”, “不聞不問”, etc.

Exercise 9.5 Continuing the previous exercise, to further study the semantic uniqueness of each pivot schema, please identify the top 5 idioms of each pivot schema according to the frequencies of the idioms in the corpus.

Please present the results for schemas whose type frequencies >= 5 (i.e., the pivot schema has at least FIVE different idioms as its constructional instances).

Please visualize your results as shown below.

Exercise 9.6 Let’s assume that we are interested in how different media may use the four-character words differently.

Please show the average number of idioms per article by different media and visualizae the results in bar plots as shown below.

The average number of idioms per article can be computed based on token frequency (i.e., on average how many idioms were observed per article?) or type frequency (i.e., on average how many different idiom types were observed per article?).

For example, there are 2557 tokens (1390 types) of idioms in the 1,953 articles published by “NewsLens”. The average token frequency of idiom uses would be: 2557/1953 = 1.30; the average type frequency of idiom uses would be: 1390/1953 = 0.71.

References

Stefanowitsch, Anatol, and Stefan Th Gries. 2003. “Collostructions: Investigating the Interaction of Words and Constructions.” International Journal of Corpus Linguistics 8 (2). John Benjamins: 209–43.